R Workshop: The Grammar of Data Wrangling with dplyr

Author

A Workshop Guide

1. Introduction: What is Data Wrangling?

Welcome! Before we can make beautiful plots or build complex models, our data is almost never in the right format. It’s often messy, has too many columns, not enough columns, or is just plain confusing.

Data Wrangling (or data manipulation) is the process of cleaning, structuring, and enriching raw data into a desired format for better decision-making in less time.

Today, we’re going to learn the most popular and powerful tool for this in R: the dplyr package.

dplyr is beloved because it provides a “grammar” for data manipulation. It gives us a few simple verbs (functions) that we can combine in logical sentences to solve complex data problems.

2. The Problem: The Instagram Reels Conundrum

We’ve all been there: “I’ll just watch one…” and suddenly it’s 45 minutes later.

As data scientists, we’ve been asked to investigate a new synthetic dataset. Our goal is to understand the factors related to high Instagram Reels usage. We have data on 200 users, including:

  • Demographics (age, gender)
  • Screen time (Reels hours per week)
  • Self-reported wellness (stress, anxiety, sleep quality)
  • Personality metrics (extraversion, conscientiousness)

Our main questions are:

  1. What does a “high-usage” user look like?
  2. Is Reels usage related to personality?
  3. Does higher usage correlate with higher stress or anxiety?
  4. Are there differences between groups (e.g., by gender)?

Let’s use dplyr to “wrangle” this data and find some answers!

3. Setup: Loading Our Tools and Data

First, we need to load our tools. We’ll load the tidyverse, which is a collection of R packages (including dplyr) designed for data science. We’ll also load DiagrammeR for our visual mind maps.

# install.packages("tidyverse") # Run this line if you don't have it installed
# install.packages("DiagrammeR") # Run this line if you don't have it installed
library(tidyverse)
library(DiagrammeR)

Next, let’s create our synthetic dataset. (In a real-world scenario, you would load this from a CSV or database using read_csv()).

# We use set.seed() so that our "random" data is the same for everyone
# This makes our workshop reproducible!
set.seed(123)
n <- 200 # We'll create 200 users

# --- Create our base variables ---
age = sample(18:60, n, replace = TRUE)
gender = sample(c("Male", "Female", "Non-binary"), n, replace = TRUE, prob = c(0.45, 0.45, 0.1))
stress_level = sample(1:10, n, replace = TRUE)
anxiety_level = sample(1:10, n, replace = TRUE)
personality_extraversion = sample(1:10, n, replace = TRUE)
personality_conscientiousness = sample(1:10, n, replace = TRUE)
reported_sleep_quality = sample(1:10, n, replace = TRUE)

# --- Create our "outcome" variable: Reels Usage ---
# We'll make it correlated with our other variables to make it interesting!
# Logic: More usage for younger, more extraverted, more anxious, poor sleepers
base_usage = 10 
reels_hours_per_week = base_usage + 
                       (personality_extraversion * 0.5) + 
                       (anxiety_level * 0.4) - 
                       ((age - 18) * 0.2) - # Younger people use it more
                       (reported_sleep_quality * 0.3) + 
                       rnorm(n, mean = 0, sd = 3) # Add some random noise

# Clean up the data: no negative hours, and round to 1 decimal
reels_hours_per_week[reels_hours_per_week < 0] <- 0
reels_hours_per_week <- round(reels_hours_per_week, 1)

# --- Combine into our final dataset (a "tibble") ---
reels_data <- tibble::tibble(
  user_id = paste0("user_", 1:n),
  age, 
  gender, 
  reels_hours_per_week,
  stress_level, 
  anxiety_level,
  personality_extraversion, 
  personality_conscientiousness,
  reported_sleep_quality
)

# Let's take a look at our data!
print(reels_data)
# A tibble: 200 × 9
   user_id   age gender     reels_hours_per_week stress_level anxiety_level
   <chr>   <int> <chr>                     <dbl>        <int>         <int>
 1 user_1     48 Female                      8.2            5             7
 2 user_2     32 Male                       11.7            2             3
 3 user_3     31 Male                       10.3            2             7
 4 user_4     20 Male                       18.3            3            10
 5 user_5     59 Non-binary                  0              5             3
 6 user_6     60 Female                      0.3            6             6
 7 user_7     54 Female                      6.3           10             7
 8 user_8     31 Male                       15.4            3             4
 9 user_9     42 Male                        7.7            3            10
10 user_10    43 Female                      7.9            1             1
# ℹ 190 more rows
# ℹ 3 more variables: personality_extraversion <int>,
#   personality_conscientiousness <int>, reported_sleep_quality <int>
# exporting the dataset for use in challenges
write_csv(reels_data, "reels_data.csv")

4. The Core Verbs of Data Wrangling

Let’s learn the grammar. The most important part of dplyr is the native R pipe operator: |>.

The “Then” Operator: The Native R Pipe (|>)

Think of the pipe (|>) as the word “then”. It takes the output of the step before it and feeds it as the first argument to the step after it.

This pipe is built directly into R (since version 4.1.0), so you don’t need any special packages to use it.

Without the pipe, code looks nested and hard to read: print(head(reels_data, 5))

With the pipe, code reads like a sentence: reels_data |> head(5) |> print() (Read as: “Take the reels_data, then take the head of it with 5 rows, then print it.”)

We will use the pipe |> to connect all our dplyr verbs.


Verb 1: select() - Choosing Columns

What it does: Subsets your data by picking specific columns.

Excel Analogy: This is like right-clicking and “Hiding” or “Unhiding” columns in Excel. You could also think of it as creating a new sheet and only copy-pasting the columns you want to keep.

Our Question: We are interested in personality. Let’s just look at the user_id and the two personality columns.

# --- Using select() ---

# We take our data, THEN...
personality_data <- reels_data |>
  # ...we select *only* these three columns
  select(user_id, personality_extraversion, personality_conscientiousness)

# Let's see the result
print(personality_data)
# A tibble: 200 × 3
   user_id personality_extraversion personality_conscientiousness
   <chr>                      <int>                         <int>
 1 user_1                         1                             7
 2 user_2                         9                             5
 3 user_3                         3                             8
 4 user_4                         9                             1
 5 user_5                         2                             7
 6 user_6                         4                            10
 7 user_7                         8                             7
 8 user_8                         4                             5
 9 user_9                         2                             1
10 user_10                        6                             7
# ℹ 190 more rows

Bonus Tip: You can also use select() to remove columns using the - (minus) sign.

# This gives us all columns *except* stress_level and anxiety_level
no_wellness_data <- reels_data |>
  select(-stress_level, -anxiety_level)

# print(no_wellness_data)

Verb 2: filter() - Subsetting Rows

What it does: Picks rows based on a logical condition.

Excel Analogy: This is the Filter button in Excel (the little funnel icon - ). You’re clicking the dropdown on a column header and checking boxes or setting a custom filter (e.g., ‘Number Filters’ > ‘Greater Than…’).

Our Question: Who are our “power users”? Let’s find everyone who uses Reels for more than 20 hours per week.

# --- Using filter() ---

# We take our data, THEN...
power_users <- reels_data |>
  # ...we filter for rows where this condition is TRUE
  filter(reels_hours_per_week > 20)

print(power_users)
# A tibble: 0 × 9
# ℹ 9 variables: user_id <chr>, age <int>, gender <chr>,
#   reels_hours_per_week <dbl>, stress_level <int>, anxiety_level <int>,
#   personality_extraversion <int>, personality_conscientiousness <int>,
#   reported_sleep_quality <int>

You can also use multiple conditions. Let’s find users who are under 25 AND have a stress level above 7.

# We can use comma (,) or the & symbol for AND
young_and_stressed <- reels_data |>
  filter(
    age < 25, 
    stress_level > 7
  )
  
print(young_and_stressed)
# A tibble: 6 × 9
  user_id    age gender reels_hours_per_week stress_level anxiety_level
  <chr>    <int> <chr>                 <dbl>        <int>         <int>
1 user_71     21 Male                   13.2           10             4
2 user_123    23 Female                 10.1            8             6
3 user_135    24 Female                 14.6            8             6
4 user_160    19 Female                 14.3           10             7
5 user_178    24 Male                    7.9            9             6
6 user_179    24 Female                  4.9            9             3
# ℹ 3 more variables: personality_extraversion <int>,
#   personality_conscientiousness <int>, reported_sleep_quality <int>

Verb 3: mutate() - Creating New Columns

What it does: Creates new columns (or changes existing ones) based on other columns.

Excel Analogy: This is like adding a new, blank column in Excel and writing a formula that refers to other cells in the same row. For example, writing =IF(D2 < 5, "Low", IF(D2 <= 15, "Medium", "High")) in a new column E and dragging that formula all the way down.

Our Question: The “hours” column is too specific. Let’s create a usage_category (‘Low’, ‘Medium’, ‘High’) to make it easier to analyze.

We’ll use a helpful function called case_when() inside our mutate() call.

# --- Using mutate() ---

# We take our data, THEN...
reels_data_with_categories <- reels_data |>
  # ...we create a new column called 'usage_category'
  mutate(
    usage_category = case_when(
      reels_hours_per_week < 5 ~ "Low",        # If hours < 5, label "Low"
      reels_hours_per_week <= 15 ~ "Medium",  # If hours is between 5 and 15
      reels_hours_per_week > 15 ~ "High"      # If hours > 15
    )
  )

# Print the result, but let's select just a few columns to see
reels_data_with_categories |>
  select(user_id, reels_hours_per_week, usage_category)
# A tibble: 200 × 3
   user_id reels_hours_per_week usage_category
   <chr>                  <dbl> <chr>         
 1 user_1                   8.2 Medium        
 2 user_2                  11.7 Medium        
 3 user_3                  10.3 Medium        
 4 user_4                  18.3 High          
 5 user_5                   0   Low           
 6 user_6                   0.3 Low           
 7 user_7                   6.3 Medium        
 8 user_8                  15.4 High          
 9 user_9                   7.7 Medium        
10 user_10                  7.9 Medium        
# ℹ 190 more rows

Verb 4: arrange() - Sorting Your Data

What it does: Re-orders the rows of your data based on one or more columns.

Excel Analogy: This is the Sort button in Excel. You select a column and choose ‘Sort A to Z’ (ascending) or ‘Sort Z to A’ (descending).

Our Question: Who are the top 5 most anxious users in our dataset?

# --- Using arrange() ---
# By default, arrange() sorts in ASCENDING order (A-Z, 1-10)

# To sort in DESCENDING order (10-1, Z-A), we use the desc() helper
most_anxious_users <- reels_data |>
  arrange(desc(anxiety_level)) # Sort by anxiety_level, highest to lowest

# Let's just look at the top 5
most_anxious_users |>
  select(user_id, anxiety_level, stress_level) |>
  head(5)
# A tibble: 5 × 3
  user_id anxiety_level stress_level
  <chr>           <int>        <int>
1 user_4             10            3
2 user_9             10            3
3 user_26            10            5
4 user_27            10            2
5 user_48            10            9

Verbs 5 & 6: group_by() and summarise() - The Dynamic Duo

This is the most powerful combination in dplyr.

Excel Analogy: This is the Pivot Table in Excel. * group_by(gender) is exactly like dragging the gender field into the ‘Rows’ area of a Pivot Table. * summarise(...) is like dragging reels_hours_per_week and stress_level into the ‘Values’ area and setting them to ‘Average’.


Thinking it Through: Our “Mind Map” for Grouping and Summarising

Goal: What is the average Reels usage and average stress level for each gender group?


Our Question: What is the average Reels usage and average stress level for each gender group?

# --- Using group_by() and summarise() ---

# We take our data, THEN...
summary_by_gender <- reels_data |>
  # ...we create groups for each unique value in 'gender'
  group_by(gender) |>
  # ...THEN, we collapse each group into summary statistics
  summarise(
    # We create a new column 'average_reels_usage'
    average_reels_usage = mean(reels_hours_per_week),
    # We create 'average_stress'
    average_stress = mean(stress_level),
    # n() is a special function that counts the number of rows in each group
    number_of_users = n()
  )

print(summary_by_gender)
# A tibble: 3 × 4
  gender     average_reels_usage average_stress number_of_users
  <chr>                    <dbl>          <dbl>           <int>
1 Female                    9.29           5.63              97
2 Male                      9.09           5.81              83
3 Non-binary                8.28           5.95              20

Look at that! In 6 lines of code, we have a clear, aggregated table answering our question. This is the magic of dplyr and Pivot Tables!

5. Demo: Putting It All Together

Now, let’s combine our verbs into one “sentence” to answer a more complex question.


Thinking it Through: Our “Mind Map” for the Final Analysis

Our Final Problem: “Find the average reels_hours_per_week and anxiety_level for users who are ‘highly extraverted’ (Extraversion > 7). We only want to see results for the ‘Male’ and ‘Female’ groups, and the final table should be sorted to show which group has the highest average Reels usage.”


Here is the entire chain of logic:

# --- The Final Chain ---

final_analysis <- reels_data |>
  
  # 1. First, FILTER to get only the users we care about
  filter(
    personality_extraversion > 7,  # Must be highly extraverted
    gender %in% c("Male", "Female")  # %in% is a helper for 'is one of these'
  ) |>
  
  # 2. THEN, GROUP by the 'gender' column
  group_by(gender) |>
  
  # 3. THEN, SUMMARISE each group
  summarise(
    avg_reels_usage = mean(reels_hours_per_week),
    avg_anxiety = mean(anxiety_level),
    count = n()
  ) |>
  
  # 4. FINALLY, ARRANGE the resulting summary table
  arrange(desc(avg_reels_usage))

# Let's print our final answer!
print(final_analysis)
# A tibble: 2 × 4
  gender avg_reels_usage avg_anxiety count
  <chr>            <dbl>       <dbl> <int>
1 Male              10.8        5.2     20
2 Female            10.4        6.85    34

6. Conclusion

Today, you’ve learned the core “grammar” of data manipulation in R:

  • select() to pick columns (like Hiding columns)
  • filter() to pick rows (like the Filter button)
  • mutate() to create/change columns (like writing a formula)
  • arrange() to sort rows (like the Sort button)
  • group_by() & summarise() to aggregate data (like a Pivot Table)
  • And the |> native R pipe to connect them all together!

With these six verbs, you can solve an incredible range of data wrangling problems. Happy coding!

7. Your Turn: Challenges!

Now it’s time to test your skills. Try to complete the following code chunks to answer the questions. The ___ is a placeholder for the code you need to add.


Challenge 1: The Least Stressed

Question: Find the 3 least stressed users (stress_level) who are over 40 years old. Show only their user_id, age, and stress_level.

Hint: By default, arrange() sorts from lowest to highest (ascending).

# --- Challenge 1 Code ---
# Find the 3 least stressed users over 40
least_stressed_over_40 <- reels_data |>
  filter(age > ___) |>
  arrange(____) |>   # Sort by stress_level (lowest first)
  select(user_id, age, ____) |>
  head(3)

# After you fill in the blanks, uncomment the line below to see your answer
# print(least_stressed_over_40)

Challenge 2: Low-Stress Summary

Question: Create a summary that shows the average personality_extraversion and average reported_sleep_quality for each gender group. But, only include users who have a stress_level of 5 or less.

# --- Challenge 2 Code ---
# Average personality and sleep for low-stress users by gender
low_stress_summary <- reels_data |>
  filter(stress_level <= ___) |>
  group_by(____) |>
  summarise(
    avg_extraversion = mean(____),
    avg_sleep = mean(____)
  )

# After you fill in the blanks, uncomment the line below to see your answer
# print(low_stress_summary)

Challenge 3: Wellness Score

Question: First, create a new column wellness_score which is the sum of stress_level and anxiety_level. Then, find the average wellness_score for users in each gender group.

# --- Challenge 3 Code ---
# Average "wellness_score" by gender
wellness_by_gender <- reels_data |>
  mutate(wellness_score = ____ + ____) |>
  group_by(____) |>
  summarise(
    avg_wellness = mean(____)
  )

# After you fill in the blanks, uncomment the line below to see your answer
# print(wellness_by_gender)